Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception

نویسندگان

  • Douwe Kiela
  • Stephen Clark
چکیده

Multi-modal semantics, which aims to ground semantic representations in perception, has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics. After having shown the quality of such auditorily grounded representations, we show how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments, and provide further analysis. We find that features transfered from deep neural networks outperform bag of audio words approaches. To our knowledge, this is the first work to construct multi-modal models from a combination of textual information and auditory information extracted from deep neural networks, and the first work to evaluate the performance of tri-modal (textual, visual and auditory) semantic models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deep embodiment: grounding semantics in perceptual modalities

Multi-modal distributional semantic models address the fact that text-based semantic models, which represent word meanings as a distribution over other words, suffer from the grounding problem. This thesis advances the field of multi-modal semantics in two directions. First, it shows that transferred convolutional neural network representations outperform the traditional bag of visual words met...

متن کامل

Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception

Multi-modal semantics has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, including measuring conceptual similarity and relatedness. We also evaluate cross-modal mappings, through a zero-shot learning task mapping between linguistic and auditory...

متن کامل

Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition

We propose a transfer deep learning (TDL) framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. Specifically, we show that we can leverage speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach first learns the analogyp...

متن کامل

Computing auditory perception*

In this paper the ingredients of computing auditory perception are reviewed. On the basic level there is neurophysiology, which is abstracted to artificial neural nets (ANNs) and enhanced by statistics to machine learning. There are high-level cognitive models derived from psychoacoustics (especially Gestalt principles). The gap between neuroscience and psychoacoustics has to be filled by numer...

متن کامل

Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In this paper, we propose a multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels. The proposed MIL framewor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Artif. Intell. Res.

دوره 60  شماره 

صفحات  -

تاریخ انتشار 2017